In this file, we will perform compositional analysis of our scaled/noramlized/rarefied microbial dataset!
scaled_physeq.RData, which includes a
rooted tree that we created in
analysis/06_Ordination/scaled_physeq.RData.Microbial abundance data—like 16S rRNA gene or metagenomics data—are typically compositional: they represent relative abundances constrained to a constant total (e.g., percent or proportions). This introduces spurious correlations and other issues if analyzed with traditional statistics. This is a very important limitation to microbial data!
Interpretation 1: Interpreting microbial abundance from relative (aka. compositional) rather than absolute counts does not allow data to be compared between studies. Additionally, since all abundances are relative, when one abundance increases another one must decrease, even if the absolute abundance did not actually change. This can introduce false positives.
pacman::p_load(tidyverse, devtools, DT, phyloseq, patchwork, install = FALSE)
# load colors
source("code/colors.R")
The following phyloseq object contains microbial community composition data in a standardized format. In this case, we’ve already normalized the reads (scaled to 17,323 per sample), which is essential for comparing relative abundances.
#load physeq
load("data/06_Ordination/scaled_physeq.RData")
# look at the data
scaled_physeq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 6060 taxa and 43 samples ]
## sample_data() Sample Data: [ 43 samples by 38 sample variables ]
## tax_table() Taxonomy Table: [ 6060 taxa by 9 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 6060 tips and 6059 internal nodes ]
# Intuition check - scaled at 17,323
min(sample_sums(scaled_physeq))
## [1] 17323
range(sample_sums(scaled_physeq))
## [1] 17323 17404
In this analysis, we will drill down from phylum to ASV level analyses, which will enable increasingly detailed insights into microbial diversity and potential ecological roles. However, it is also important for us to remember that deeper levels also come with increased noise and data sparsity, especially for rare groups.
Taxonomic analysis often begins at broader levels (e.g., Phylum) to visualize overarching trends before zooming in on finer-scale patterns. This step allows us to identify which microbial groups dominate across samples and which may respond to environmental gradients like salinity.
Now, let’s calculate the relative abundance of the phyla across all the samples. NOTE: To do this, it is imperative that we have scaled the data to a constant sequencing depth.
#Create a phylum level datafram
phylum_df <-
scaled_physeq %>%
# Agglomerate all ASV counts within a phylum
tax_glom(taxrank = "Phylum") %>% # We have 31 Phyla
# Calculate relative abundance !
transform_sample_counts(function(x) {x/sum(x)}) %>%
#create df from phyloseq object
psmelt() %>%
# Filter out phyla < 1%
dplyr::filter(Abundance > 0.01) %>%
# fix the order of age sample site
mutate(SampleSite = fct_relevel(SampleSite, c("CLOACA", "ORAL", "TANK WATER")),
AgeRange = fct_relevel(AgeRange, c("JUVENILE", "SUB-ADULT",
"ADULT")))
## What are the phylum abundances?
phylum_df %>%
group_by(Phylum) %>%
summarize(mean_PercAbund = round(mean(Abundance), digits = 4)) %>%
arrange(-mean_PercAbund) %>%
datatable()
# Make a list of the top phyla
top10_phyla <-
phylum_df %>%
group_by(Phylum) %>%
summarize(mean_PercAbund = mean(Abundance)) %>%
arrange(-mean_PercAbund) %>%
head(n = 10) %>%
pull(Phylum)
# intuition check - should be the same
top10_phyla
## [1] "Pseudomonadota" "Bacteroidota"
## [3] "Bacillota" "Dependentiae"
## [5] "Spirochaetota" "Balneolota"
## [7] "Verrucomicrobiota" "Thermodesulfobacteriota"
## [9] "Myxococcota" "Planctomycetota"
Interpretation 2: Pseudomonadota, Bacteroidota, and Bacillota have the highest relative abundance in my dataset. Their abundaces are 58%, 37%, and 12%, respectively.
Visualization helps detect patterns in composition and abundance that statistical models may later test. Stacked bar plots are often used but can obscure individual sample variation and visualize too much data all at once.
Therefore, we will also explore faceted and jittered boxplots below the bar plots to see sample-level trends more clearly.
# Stacked Bar Plot With All phyla
# Plot Phylum Abundances - make sure to load phylum_colors
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
# Warning: It's important to have one sample per x value,
# otherwise, it will take the sum between multiple samples
# chose one turtle for juvenile, sub-adult, and adult, but I don't think this looks right
dplyr::filter(TurtleName == c("MARVIN", "MARO", "SAMBA")) %>%
ggplot(aes(x = SampleSite, y = Abundance, fill = Phylum)) +
facet_grid(.~AgeRange) +
geom_bar(stat = "identity", color = "black") +
labs(title = "Top 10 Phyla: Surface Samples") +
scale_fill_manual(values = phylum_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
#Relative abundance needs to be between 0 and 1, or else you're plotting more than one sample
# Stacked Bar Plot looking at all samples and grouping by age range
# Plot Phylum Abundances - make sure to load phylum_colors
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
facet_grid(. ~ AgeRange, scales = "free_x", space = "free_x") +
geom_bar(stat = "identity", color = "black") +
labs(title = "Top 10 Phyla per Turtle") +
scale_fill_manual(values = phylum_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
strip.background = element_blank(),
panel.spacing = unit(1, "lines"))
# Stacked Bar Plot looking at all samples and grouping by sample site
# Plot Phylum Abundances - make sure to load phylum_colors
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
facet_grid(. ~ SampleSite, scales = "free_x", space = "free_x") +
geom_bar(stat = "identity", color = "black") +
labs(title = "Top 10 Phyla per Turtle") +
scale_fill_manual(values = phylum_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
strip.background = element_blank(),
panel.spacing = unit(1, "lines"))
# Stacked Bar Plot looking at all samples and grouping by age AND sample site
# Plot Phylum Abundances - make sure to load phylum_colors
phylum_df %>%
filter(Phylum %in% top10_phyla) %>%
ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
facet_grid(SampleSite ~ AgeRange, scales = "free_x", space = "free_x") +
geom_bar(stat = "identity", color = "black") +
labs(title = "Top 10 Phyla per Turtle by Sample Site and Age Range") +
scale_fill_manual(values = phylum_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
strip.background = element_blank(),
panel.spacing = unit(1, "lines"))
To help compare the phylum abundance between sample types, we can facet by phylum to better see how the changes occur across ages, which is masked in the stacked bar plot. It’s a little better than the stacked bar plot, however, we can do even better!
# Group by Age
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
facet_grid(Phylum~AgeRange, scale = "free") +
# add the stacked bar
geom_bar(stat = "identity", color = "black") +
# change the colors to be our selected colors
scale_fill_manual(values = phylum_colors) +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
# Group by Sample Site
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
facet_grid(Phylum~SampleSite, scale = "free") +
# add the stacked bar
geom_bar(stat = "identity", color = "black") +
# change the colors to be our selected colors
scale_fill_manual(values = phylum_colors) +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
### Or combined together:
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) +
facet_grid(Phylum~SampleSite, scale = "free") +
# add the stacked bar
geom_jitter() +
geom_boxplot(outlier.shape = NA, alpha = 0.5) +
# change the colors to be our selected colors
scale_fill_manual(values = phylum_colors) +
scale_color_manual(values = phylum_colors) +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
### Or combined together:
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) +
facet_wrap(Phylum~., scales = "free", nrow = 2) +
# add the stacked bar
geom_jitter() +
geom_boxplot(outlier.shape = NA, alpha = 0.5) +
# change the colors to be our selected colors
scale_fill_manual(values = phylum_colors) +
scale_color_manual(values = phylum_colors) +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
dplyr::filter(SampleSite == "CLOACA") %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) +
facet_wrap(Phylum~., scales = "free", nrow = 2) +
# add the stacked bar
geom_jitter() +
geom_boxplot(outlier.shape = NA, alpha = 0.5) +
# change the colors to be our selected colors
scale_fill_manual(values = phylum_colors) +
scale_color_manual(values = phylum_colors) +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
dplyr::filter(SampleSite == "ORAL") %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) +
facet_wrap(Phylum~., scales = "free", nrow = 2) +
# add the stacked bar
geom_jitter() +
geom_boxplot(outlier.shape = NA, alpha = 0.5) +
# change the colors to be our selected colors
scale_fill_manual(values = phylum_colors) +
scale_color_manual(values = phylum_colors) +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
dplyr::filter(SampleSite == "TANK WATER") %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) +
facet_wrap(Phylum~., scales = "free", nrow = 2) +
# add the stacked bar
geom_jitter() +
geom_boxplot(outlier.shape = NA, alpha = 0.5) +
# change the colors to be our selected colors
scale_fill_manual(values = phylum_colors) +
scale_color_manual(values = phylum_colors) +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
### Or combined together:
phylum_df %>%
dplyr::filter(Phylum %in% top10_phyla) %>%
ggplot(aes(x = SampleSite, y = Abundance, fill = Phylum, color = Phylum)) +
facet_wrap(Phylum~., scales = "free", nrow = 2) +
# add the stacked bar
geom_jitter() +
geom_boxplot(outlier.shape = NA, alpha = 0.5) +
# change the colors to be our selected colors
scale_fill_manual(values = phylum_colors) +
scale_color_manual(values = phylum_colors) +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
notes - some phyla unique to adults or sub-adults? (Dependentiae, Thermodesulfobacteriota, Balneolota) - some phyla shared between ages (Bacteroidota, Pseudomonadota) - some phyla differ across ages (sub-adult Verrucomicrobiota)
After initial exploration, we focus on specific phyla that appear to vary across ages. These targeted plots help develop hypotheses about ecological drivers.
# Narrow in on a specific group
# Verrucomicrobiota - y: abundance, x: age, dot plot + boxplot
verrucom_phylum_AgeRange <-
phylum_df %>%
dplyr::filter(Phylum == "Verrucomicrobiota") %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Verrucomicrobiota Phylum") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "right")
# Separate by Sample Site
phylum_df %>%
filter(Phylum == "Verrucomicrobiota") %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = AgeRange, color = AgeRange)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6) +
facet_wrap(~ SampleSite) +
#stat_compare_means(method = "kruskal.test", label = "p.format") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
labs(title = "Verrucomicrobiota Phylum by Sample Site and Age Range") +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "right")
# Statistically: Kruskall-Wallis followed by a Tukey's Posthoc test
# These are non-parametric (non-normal) stat tests
# CONTINUOUS
verrucom_phylum_weight <-
phylum_df %>%
dplyr::filter(Phylum == "Verrucomicrobiota") %>%
ggplot(aes(x = Weight, y = Abundance)) +
geom_point(aes(color = AgeRange)) +
theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Verrucomicrobiota Phylum") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect both of the plots together into one
verrucom_phylum_AgeRange + verrucom_phylum_weight +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
Def more in juveniles when looking at Cloaca Possibly more in sub-adults when looking at Oral
# Narrow in on a specific group
# Myxococcota - y: abundance, x: age, dot plot + boxplot
myxoc_phylum_AgeRange <-
phylum_df %>%
dplyr::filter(Phylum == "Myxococcota") %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Myxococcota Phylum") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "right")
# Statistically: Kruskall-Wallis followed by a Tukey's Posthoc test
# These are non-parametric (non-normal) stat tests
# Separate by Sample Site and test with KW
phylum_df %>%
filter(Phylum == "Myxococcota") %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = AgeRange, color = AgeRange)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6) +
facet_wrap(~ SampleSite) +
#stat_compare_means(method = "kruskal.test", label = "p.format") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
labs(title = "Myxococcota Phylum by Sample Site and Age Range") +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "right")
# CONTINUOUS
myxoc_phylum_weight <-
phylum_df %>%
dplyr::filter(Phylum == "Myxococcota") %>%
ggplot(aes(x = Weight, y = Abundance)) +
geom_point(aes(color = AgeRange)) +
theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Myxococcota Phylum") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect both of the plots together into one
myxoc_phylum_AgeRange + myxoc_phylum_weight +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
lower in sub-adults?
# Narrow in on a specific group
# Planctomycetota - y: abundance, x: age, dot plot + boxplot
plant_phylum_AgeRange <-
phylum_df %>%
dplyr::filter(Phylum == "Planctomycetota") %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Planctomycetota Phylum") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "right")
# Statistically: Kruskall-Wallis followed by a Tukey's Posthoc test
# These are non-parametric (non-normal) stat tests
# Separate by Sample Site and test with KW
phylum_df %>%
filter(Phylum == "Planctomycetota") %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = AgeRange, color = AgeRange)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6) +
facet_wrap(~ SampleSite) +
#stat_compare_means(method = "kruskal.test", label = "p.format") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
labs(title = "Planctomycetota Phylum by Sample Site and Age Range") +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "right")
# CONTINUOUS
plant_phylum_weight <-
phylum_df %>%
dplyr::filter(Phylum == "Planctomycetota") %>%
ggplot(aes(x = Weight, y = Abundance)) +
geom_point(aes(color = AgeRange)) +
theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Planctomycetota Phylum") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect both of the plots together into one
plant_phylum_AgeRange + plant_phylum_weight +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
not rly different
# Narrow in on a specific group
# Bacteroidota - y: abundance, x: age, dot plot + boxplot
bacter_phylum_AgeRange <-
phylum_df %>%
dplyr::filter(Phylum == "Bacteroidota") %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Bacteroidota Phylum") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "right")
# Statistically: Kruskall-Wallis followed by a Tukey's Posthoc test
# These are non-parametric (non-normal) stat tests
# Separate by Sample Site and test with KW
phylum_df %>%
filter(Phylum == "Bacteroidota") %>%
ggplot(aes(x = AgeRange, y = Abundance, fill = AgeRange, color = AgeRange)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6) +
facet_wrap(~ SampleSite) +
#stat_compare_means(method = "kruskal.test", label = "p.format") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
labs(title = "Bacteroidota Phylum by Sample Site and Age Range") +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "right")
# CONTINUOUS
bacter_phylum_weight <-
phylum_df %>%
dplyr::filter(Phylum == "Bacteroidota") %>%
ggplot(aes(x = Weight, y = Abundance)) +
geom_point(aes(color = AgeRange)) +
theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Bacteroidota Phylum") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect both of the plots together into one
bacter_phylum_AgeRange + bacter_phylum_weight +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
Not rly a difference
Interpretation 4: I would most want to look into Verrucomicrobiota and Bacteroidota.
Let’s first calculate the genus data frame.
# Calculate the Family relative abundance
# Note: The read depth MUST be normalized in some way: scale_reads
genus_df <-
scaled_physeq %>%
# agglomerate at the phylum level
tax_glom(taxrank = "Genus") %>%
# Transform counts to relative abundance
transform_sample_counts(function (x) {x/sum(x)}) %>%
# Melt to a long format
psmelt() %>%
# fix the order of sample site
mutate(SampleSite = fct_relevel(SampleSite, c("CLOACA", "ORAL", "TANK WATER")),
AgeRange = fct_relevel(AgeRange, c("JUVENILE", "SUB-ADULT",
"ADULT")))
# Verrucomicrobiota
# Plot genus
verr_genus_AgeRange <-
genus_df %>%
dplyr::filter(Phylum == "Verrucomicrobiota") %>%
dplyr::filter(SampleSite == "CLOACA") %>%
# At first, plot all of the genera and then subset the ones that have intersting trends
dplyr::filter(Genus %in% c("Rubritalea", "Diplosphaera")) %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
facet_wrap(.~Genus, scales = "free_y", nrow = 1) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Verrucomicrobiota Genera in Cloaca") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "none")
# Plot genus: Continuous
verr_genus_weight <-
genus_df %>%
dplyr::filter(Phylum == "Verrucomicrobiota") %>%
dplyr::filter(SampleSite == "CLOACA") %>%
dplyr::filter(Genus %in% c("Rubritalea", "Diplosphaera")) %>%
# build the plot
ggplot(aes(x = Weight, y = Abundance)) +
facet_wrap(.~Genus, scales = "free_y", nrow = 1) +
geom_point(aes(color = AgeRange)) + theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Verrucomicrobiota Genera in Cloaca") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect the Actinomycetota Plots
verr_genus_AgeRange / verr_genus_weight +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
There’s rly no difference
# Bacteroidota
# Plot genus
bacter_genus_AgeRange <-
genus_df %>%
dplyr::filter(Phylum == "Bacteroidota") %>%
dplyr::filter(SampleSite == "CLOACA") %>%
# At first, plot all of the genera and then subset the ones that have intersting trends
dplyr::filter(Genus %in% c("Tenacibaculum", "Bacteroides")) %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
facet_wrap(.~Genus, scales = "free_y", nrow = 1) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Bacteroidota Genera in Cloaca") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "none")
# Plot genus: Continuous
bacter_genus_weight <-
genus_df %>%
dplyr::filter(Phylum == "Bacteroidota") %>%
dplyr::filter(SampleSite == "CLOACA") %>%
dplyr::filter(Genus %in% c("Tenacibaculum", "Bacteroides")) %>%
# build the plot
ggplot(aes(x = Weight, y = Abundance)) +
facet_wrap(.~Genus, scales = "free_y", nrow = 1) +
geom_point(aes(color = AgeRange)) + theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Bacteroidota Genera in Cloaca") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect the Bacteroidota Plots
bacter_genus_AgeRange / bacter_genus_weight +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
Now, let’s take a look at the ASVs. This is where the real ecology/biology happens! This is because the ASV-level plots will provide us with a more detailed view of which specific taxa (ASVs) are driving the overall trends seen at the higher taxonomic levels (e.g. phylum) in relation to salinity gradients and station differences.
Before we calculate the ASV-level abundances, let’s take a second to think about who we’d like to consider. There’s a lot of ASVs! In fact, there are 6060! It is difficult analyze all of them! Therefore, we will remove ASVs that have an overall count across the entire dataset of 1,732 or 1% of a scaled sample to 17,323. This is a little arbitrary and therefore, you may choose a different threshold for your dataset. The goal here is to dramatically decreases the number of ASVs in the dataset to help make the ASV-level data analysis much easier.
Of course, if we had more time in class, we could run differential abundance and this would also help us to identify and statistically test which ASVs are important and also help us narrow on specific ASVs.
The goal with this lesson is to do some data exploration and then also test ASVs that are relevant for our specific study. Let’s go!
# Calculate the Family relative abundance
# Note: The read depth MUST be normalized in some way: scale_reads
ASV_df <-
scaled_physeq %>%
# Prune out ASVs that have fewer than 100 counts!
## LOOK AT HOW MANY ARE REMOVED! We scaled to 17,323 reads! :O
prune_taxa(taxa_sums(.) >= 1732, .) %>%
# agglomerate at the phylum level
tax_glom(taxrank = "ASV") %>%
# Transform counts to relative abundance
transform_sample_counts(function (x) {x/sum(x)}) %>%
# Melt to a long format
psmelt() %>%
# fix the order of age sample site
mutate(SampleSite = fct_relevel(SampleSite, c("CLOACA", "ORAL", "TANK WATER")),
AgeRange = fct_relevel(AgeRange, c("JUVENILE", "SUB-ADULT",
"ADULT")))
There are only Bacillota and Bacteroidota in this dataframe, so I cannot investigate the others.
# Calculate top couple of ASVs
# Make a list of phyla the top phyla
top_bacil_ASVs <-
ASV_df %>%
dplyr::filter(Phylum == "Bacillota") %>%
group_by(ASV) %>%
summarize(mean_Abundance = mean(Abundance)) %>%
dplyr::filter(mean_Abundance > 0.005) %>%
pull(ASV)
# Bacillota
# Plot ASVs
bacil_asv_AgeRange <-
ASV_df %>%
dplyr::filter(ASV %in% top_bacil_ASVs) %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Bacillota ASVs") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "none")
# Plot ASVs: Continuous
bacil_asv_weight <-
ASV_df %>%
dplyr::filter(ASV %in% top_bacil_ASVs) %>%
# build the plot
ggplot(aes(x = Weight, y = Abundance)) +
facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) +
geom_point(aes(color = AgeRange)) + theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Bacillota ASVs") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect the Bacillota Plots
bacil_asv_AgeRange / bacil_asv_weight +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
there’s rly no difference
# Calculate top couple of ASVs
# Make a list of phyla the top phyla
top_bacter_ASVs <-
ASV_df %>%
dplyr::filter(Phylum == "Bacteroidota") %>%
group_by(ASV) %>%
summarize(mean_Abundance = mean(Abundance)) %>%
dplyr::filter(mean_Abundance > 0.005) %>%
pull(ASV)
# Bacteroidota
# Plot ASVs
bacter_asv_AgeRange <-
ASV_df %>%
dplyr::filter(ASV %in% top_bacter_ASVs) %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Bacteroidota ASVs") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "none")
# Plot ASVs: Continuous
bacter_asv_weight <-
ASV_df %>%
dplyr::filter(ASV %in% top_bacter_ASVs) %>%
# build the plot
ggplot(aes(x = Weight, y = Abundance)) +
facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) +
geom_point(aes(color = AgeRange)) + theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Bacillota ASVs") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect the Bacteroidota Plots
bacter_asv_AgeRange / bacter_asv_weight +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
# choose specific ASVs to make graphs easier to see
# Plot ASVs
bacter_asv_AgeRange_filtered <-
ASV_df %>%
dplyr::filter(ASV %in% top_bacter_ASVs) %>%
dplyr::filter(SampleSite == "CLOACA") %>%
dplyr::filter(ASV %in% c("ASV_0010", "ASV_0031", "ASV_0044", "ASV_0083", "ASV_0074")) %>%
# build the plot
ggplot(aes(x = AgeRange, y = Abundance,
fill = AgeRange, color = AgeRange)) +
facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot
geom_jitter() +
theme_bw() +
labs(title = "Bacteroidota ASVs") +
scale_color_manual(values = AgeRange_colors) +
scale_fill_manual(values = AgeRange_colors) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.position = "none")
# Plot ASVs: Continuous
bacter_asv_weight_filtered <-
ASV_df %>%
dplyr::filter(ASV %in% top_bacter_ASVs) %>%
dplyr::filter(SampleSite == "CLOACA") %>%
dplyr::filter(ASV %in% c("ASV_0010", "ASV_0031", "ASV_0044", "ASV_0083", "ASV_0074")) %>%
# build the plot
ggplot(aes(x = Weight, y = Abundance)) +
facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) +
geom_point(aes(color = AgeRange)) + theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(title = "Bacillota ASVs") +
scale_color_manual(values = AgeRange_colors) +
theme(legend.position = "none")
# Collect the Bacteroidota Plots
bacter_asv_AgeRange_filtered / bacter_asv_weight_filtered +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")
ASV_0002 gained with age
ASV_0097 lost with age
ASV_0055 completely gone in adults
Potentially slight trend of losing ASVs with age
Interpretation 6: replace
Interpretation 7: replace
Interpretation 8: replace
For reproducibility
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.3 (2023-03-15)
## os Rocky Linux 9.4 (Blue Onyx)
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2025-04-29
## pandoc 3.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## ade4 1.7-23 2025-02-14 [1] CRAN (R 4.2.3)
## ape 5.8-1 2024-12-16 [1] CRAN (R 4.2.3)
## Biobase 2.58.0 2022-11-01 [2] Bioconductor
## BiocGenerics 0.44.0 2022-11-01 [2] Bioconductor
## biomformat 1.26.0 2022-11-01 [1] Bioconductor
## Biostrings 2.66.0 2022-11-01 [2] Bioconductor
## bitops 1.0-9 2024-10-03 [2] CRAN (R 4.2.3)
## bslib 0.8.0 2024-07-29 [2] CRAN (R 4.2.3)
## cachem 1.1.0 2024-05-16 [2] CRAN (R 4.2.3)
## cli 3.6.4 2025-02-13 [1] CRAN (R 4.2.3)
## cluster 2.1.4 2022-08-22 [2] CRAN (R 4.2.3)
## codetools 0.2-19 2023-02-01 [2] CRAN (R 4.2.3)
## colorspace 2.1-1 2024-07-26 [2] CRAN (R 4.2.3)
## crayon 1.5.3 2024-06-20 [2] CRAN (R 4.2.3)
## crosstalk 1.2.1 2023-11-23 [2] CRAN (R 4.2.3)
## data.table 1.16.2 2024-10-10 [2] CRAN (R 4.2.3)
## devtools * 2.4.5 2022-10-11 [1] CRAN (R 4.2.3)
## digest 0.6.37 2024-08-19 [2] CRAN (R 4.2.3)
## dplyr * 1.1.4 2023-11-17 [2] CRAN (R 4.2.3)
## DT * 0.33 2024-04-04 [1] CRAN (R 4.2.3)
## ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.3)
## evaluate 1.0.1 2024-10-10 [2] CRAN (R 4.2.3)
## farver 2.1.2 2024-05-13 [2] CRAN (R 4.2.3)
## fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.2.3)
## forcats * 1.0.0 2023-01-29 [2] CRAN (R 4.2.3)
## foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.3)
## fs 1.6.5 2024-10-30 [2] CRAN (R 4.2.3)
## generics 0.1.3 2022-07-05 [2] CRAN (R 4.2.3)
## GenomeInfoDb 1.34.9 2023-02-02 [2] Bioconductor
## GenomeInfoDbData 1.2.9 2024-11-25 [2] Bioconductor
## ggplot2 * 3.5.1 2024-04-23 [2] CRAN (R 4.2.3)
## glue 1.8.0 2024-09-30 [2] CRAN (R 4.2.3)
## gtable 0.3.6 2024-10-25 [2] CRAN (R 4.2.3)
## hms 1.1.3 2023-03-21 [2] CRAN (R 4.2.3)
## htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.2.3)
## htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.2.3)
## httpuv 1.6.15 2024-03-26 [2] CRAN (R 4.2.3)
## igraph 2.1.1 2024-10-19 [2] CRAN (R 4.2.3)
## IRanges 2.32.0 2022-11-01 [2] Bioconductor
## iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.3)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.2.3)
## jsonlite 1.8.9 2024-09-20 [2] CRAN (R 4.2.3)
## knitr 1.49 2024-11-08 [2] CRAN (R 4.2.3)
## labeling 0.4.3 2023-08-29 [2] CRAN (R 4.2.3)
## later 1.3.2 2023-12-06 [2] CRAN (R 4.2.3)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.3)
## lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.2.3)
## lubridate * 1.9.3 2023-09-27 [2] CRAN (R 4.2.3)
## magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.2.3)
## MASS 7.3-58.2 2023-01-23 [2] CRAN (R 4.2.3)
## Matrix 1.6-4 2023-11-30 [2] CRAN (R 4.2.3)
## memoise 2.0.1 2021-11-26 [2] CRAN (R 4.2.3)
## mgcv 1.8-42 2023-03-02 [2] CRAN (R 4.2.3)
## mime 0.12 2021-09-28 [2] CRAN (R 4.2.3)
## miniUI 0.1.1.1 2018-05-18 [2] CRAN (R 4.2.3)
## multtest 2.54.0 2022-11-01 [1] Bioconductor
## munsell 0.5.1 2024-04-01 [2] CRAN (R 4.2.3)
## nlme 3.1-162 2023-01-31 [2] CRAN (R 4.2.3)
## pacman 0.5.1 2019-03-11 [1] CRAN (R 4.2.3)
## patchwork * 1.3.0.9000 2025-02-19 [1] Github (thomasp85/patchwork@2695a9f)
## permute 0.9-7 2022-01-27 [1] CRAN (R 4.2.3)
## phyloseq * 1.42.0 2022-11-01 [1] Bioconductor
## pillar 1.10.1 2025-01-07 [1] CRAN (R 4.2.3)
## pkgbuild 1.4.5 2024-10-28 [2] CRAN (R 4.2.3)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.3)
## pkgload 1.4.0 2024-06-28 [2] CRAN (R 4.2.3)
## plyr 1.8.9 2023-10-02 [2] CRAN (R 4.2.3)
## profvis 0.4.0 2024-09-20 [2] CRAN (R 4.2.3)
## promises 1.3.0 2024-04-05 [2] CRAN (R 4.2.3)
## purrr * 1.0.2 2023-08-10 [2] CRAN (R 4.2.3)
## R6 2.6.1 2025-02-15 [1] CRAN (R 4.2.3)
## Rcpp 1.0.13-1 2024-11-02 [2] CRAN (R 4.2.3)
## RCurl 1.98-1.16 2024-07-11 [2] CRAN (R 4.2.3)
## readr * 2.1.5 2024-01-10 [2] CRAN (R 4.2.3)
## remotes 2.5.0 2024-03-17 [2] CRAN (R 4.2.3)
## reshape2 1.4.4 2020-04-09 [2] CRAN (R 4.2.3)
## rhdf5 2.42.1 2023-04-07 [1] Bioconductor
## rhdf5filters 1.10.1 2023-03-24 [1] Bioconductor
## Rhdf5lib 1.20.0 2022-11-01 [1] Bioconductor
## rlang 1.1.5 2025-01-17 [1] CRAN (R 4.2.3)
## rmarkdown 2.29 2024-11-04 [2] CRAN (R 4.2.3)
## rstudioapi 0.17.1 2024-10-22 [2] CRAN (R 4.2.3)
## S4Vectors 0.36.2 2023-02-26 [2] Bioconductor
## sass 0.4.9 2024-03-15 [2] CRAN (R 4.2.3)
## scales 1.3.0 2023-11-28 [2] CRAN (R 4.2.3)
## sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.3)
## shiny 1.9.1 2024-08-01 [2] CRAN (R 4.2.3)
## stringi 1.8.4 2024-05-06 [2] CRAN (R 4.2.3)
## stringr * 1.5.1 2023-11-14 [2] CRAN (R 4.2.3)
## survival 3.5-3 2023-02-12 [2] CRAN (R 4.2.3)
## tibble * 3.2.1 2023-03-20 [2] CRAN (R 4.2.3)
## tidyr * 1.3.1 2024-01-24 [2] CRAN (R 4.2.3)
## tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.2.3)
## tidyverse * 2.0.0 2023-02-22 [2] CRAN (R 4.2.3)
## timechange 0.3.0 2024-01-18 [2] CRAN (R 4.2.3)
## tzdb 0.4.0 2023-05-12 [2] CRAN (R 4.2.3)
## urlchecker 1.0.1 2021-11-30 [2] CRAN (R 4.2.3)
## usethis * 3.0.0 2024-07-29 [2] CRAN (R 4.2.3)
## vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.2.3)
## vegan 2.6-10 2025-01-29 [1] CRAN (R 4.2.3)
## withr 3.0.2 2024-10-28 [2] CRAN (R 4.2.3)
## xfun 0.49 2024-10-31 [2] CRAN (R 4.2.3)
## xtable 1.8-4 2019-04-21 [2] CRAN (R 4.2.3)
## XVector 0.38.0 2022-11-01 [2] Bioconductor
## yaml 2.3.10 2024-07-26 [2] CRAN (R 4.2.3)
## zlibbioc 1.44.0 2022-11-01 [2] Bioconductor
##
## [1] /home/jt698/R/x86_64-pc-linux-gnu-library/4.2
## [2] /programs/R-4.2.3/lib64/R/library
##
## ──────────────────────────────────────────────────────────────────────────────